3.5 MCMC

1 MCMC

Hierarchical models can get very complex quickly, creating big computational headaches. For λ(θ|x)=pθ(x)λ(θ)Ωpζ(x)λ(ζ)dζ, the integral is often intractable.
A computational strategy is to set up a Markov chain with stationary distribution pθ(x)λ(θ), and run it to get approximate samples from λ(θ|x).

Compared with Markov chain in probability, the notation here is more Bayesian.

Markov Chain

A (stationary) Markov chain with transition kernel Q(y|x) and initial distribution π0(x) is a sequence of RVs X(0),X(1), where X(0)π0 and X(t+1)|X(0),,X(t)Q(|X(t)), where Q(y|x)=P(X(t+1)=y|X(t)=x).

The marginal distribution of X(1) is π1(y)=P(X(1)=y)=XQ(y|x)π0(x)dμ(x).
This is a directed graphical model: X(0)X(1)X(2)
If π(y)=XQ(y|x)π(x)dμ(x) we say π is a stationary distribution for Q.
A sufficient condition for stationary distribution is detailed balance: π(x)Q(y|x)=π(y)Q(x|y),x,y.

Because this leads to π(y)=π(y)XQ(x|y)dμ(x)=XQ(y|x)π(x)dμ(x).

A Markov chain with detailed balance is called reversible, i.e. if π0=π, (X(0),,X(t))=d(X(t),,X(0)).

Note that P(X(t)=x|X(t+1)=y)=P(X(t)=x)P(X(t+1)=y|X(t)=x)P(X(t+1)=y)=π(x)Q(y|x)π(y)=Q(x|y).

Theorem

If an MC with stationary distribution π is

  1. Irreducible: x,y,n: p(X(n)=y|X(0)=x)>0.
  2. Aperiodic: x,gcd{n>0|p(X(n)=x|X(0)=x)>0}=1.

Then L(X(t))π regardless of π0.

The proof is beyond scope of this note.

So the strategy is to find Q with stationary distribution λ(θ|X), starting at any X, run chain for a long time, and we have X(t) sample from posterior, for large t.

2 Gibbs Sampler

Denote the parameter vector θ=(θ1,,θd).

Gibbs Sampler

  • Initialize θ=θ(0).
  • For t=1,,T:
    • For j=1,,d,
      • Sample θjλ(θj|θj,x) (*)
    • Record θ(t)=θ.

θj=(θ1,,θj1,θj+1,,θd) means parameters apart from θj.

Variations on (*):

Advantage for hierarchical priors:

2.1 Stationarity of λ(θ|X)

Claim

If θ(t)λ(θ|X) then θ(t+1)λ(θ|X).

3 MCMC in Practice

Pasted image 20241209151131.png
For the last (desirable) one, the posterior mean is 1N+1k=0NθjB+ksE[θj|X].

But this will cause Gibbs to take a long time to mix. (The image is a very sharp ellipse.) A better parameterization is {β1=θ1+θ2,β2=θ1θ2, so β1β2|X. The Gibbs sampler is equivalent to directly sampling from posterior.

4 Empirical Bayes

Back to the Gaussian hierarchical model 1d||X||21+τ2dχd2(1+τ2,2(1+τ2)d), the MLE for 1+τ2 is 1d||X||2.
For any "reasonable" prior, E[ζ|X]d||X||2, which should be close to MLE. θ^i(1d||X||2)(1ζ)Xi.

If prior doesn't matter much, why use one? Could just estimate ζ from data however we want, "plug it in".

UMVUE: ζ^=d2||X||2.

Call Empirical Bayes a hybrid approach in which hyperparameters are treated as fixed, others treated as random.